Dataset Details

This dataset has been sourced from Kaggle.It contains marketing data of 2240 customers and includes 28 features on customer profiles, products purchased, campaign details and channel performance.

People

  1. ID: Customer’s unique identifier
  2. Year_Birth: Customer’s birth year
  3. Education: Customer’s education level
  4. Marital_Status: Customer’s marital status
  5. Income: Customer’s yearly household income
  6. Kidhome: Number of children in customer’s household
  7. Teenhome: Number of teenagers in customer’s household
  8. Dt_Customer: Date of customer’s enrollment with the company
  9. Recency: Number of days since customer’s last purchase
  10. Complain: 1 if customer complained in the last 2 years, 0 otherwise
  11. Country: Customer’s location

Products

  1. MntWines: Amount spent on wine in the last 2 years
  2. MntFruits: Amount spent on fruits in the last 2 years
  3. MntMeatProducts: Amount spent on meat in the last 2 years
  4. MntFishProducts: Amount spent on fish in the last 2 years
  5. MntSweetProducts: Amount spent on sweets in the last 2 years
  6. MntGoldProds: Amount spent on gold in the last 2 years

Place

  1. NumWebPurchases: Number of purchases made through the company’s web site.
  2. NumCatalogPurchases: Number of purchases made using a catalogue.
  3. NumStorePurchases: Number of purchases made directly in stores
  4. NumWebVisitsMonth: Number of visits to company’s web site in the last month

Promotion

  1. NumDealsPurchases: Number of purchases made with a discount
  2. AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
  3. AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
  4. AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
  5. AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
  6. AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
  7. Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
options(warn=-1)
library(knitr)
library(DT)

#reading the data
setwd("C:/Users/heena/OneDrive/Desktop/CS 544/Project")
#df <- read.csv(file.choose(),header = T)
df<-read.csv("marketing_data.csv")

#seeing first 6 rows of the data
head(df)

#checking dimensions of data
dim(df)

#feature names
colnames(df)

#checking column datatypes
sapply(df,class)
#kable(head(df),caption="Marketing Data",format="markdown")
datatable(df)

Objective

Our objective was to analyze the customer data and answer the following questions:

  1. How does the number of kids a customer has influence their expenditure?
  2. What is the effect of number of teenagers a customer has on how much they spend?
  3. Country wise expenditures.
  4. How much does each Country spend on products of various categories?
  5. Which is the most frequent education status and marital status category?
  6. Type of purchases (Web or In-store) based on education status.
  7. Type of purchases (Web or In-store) based on marital status.
  8. Is there a relation between income and expenditure?
  9. People of what age opt for the most number of deals?
  10. Influence of number of teens a customer has on web vs in store purchases.
  11. Influence of number a kids a customer has on web vs in store purchases.
  12. Types of purchases based on number of kids and number of teenagers.

Data Preprocessing

We converted the Income variable from character to numeric datatype. We customized the levels for Education variable and converted it to an ordered factor (High school < Under.graduate < Master < PhD). We also converted the columns Country,Marital status, Kidhome, Teenhome, whether the customer complained or not and whether the customer accepted any of the 5 campaigns to factor data type (unordered). Null values were found in the Education column and were removed. We created a function to check for outliers using the Inter Quartile Range criteria. Outliers were found in certain columns but we decided to keep them since they show important variability in the data.These values were quite close to each other and did not seem like data entry errors. The income and age columns were found to have some very extreme values, we found the row number for the same and created a subset of the data excluding those particular rows. A column was added for the customer’s age (calculated from Year of Birth). Expenditure on various products was added to give a new feature, total expenditure by every customer.

library("tidyverse")
library("plotly")
df$Income<-parse_number(df$Income)
df$Education[df$Education=='Basic']<-'High.school'
df$Education[df$Education=='Graduation']<-'Under.graduate'
df$Education[df$Education=='2n Cycle']<-'Master'
df$Education<-factor(df$Education,ordered=T,levels=c('High school','Under.graduate','Master','PhD'))
df$Marital_Status<-factor(df$Marital_Status)
df$Country<-factor(df$Country)
df$Kidhome<-factor(df$Kidhome)
df$Teenhome<-factor(df$Teenhome)
df$Complain<-factor(df$Complain)
df$Response<-factor(df$Response)
df$AcceptedCmp1<-factor(df$AcceptedCmp1)
df$AcceptedCmp2<-factor(df$AcceptedCmp2)
df$AcceptedCmp3<-factor(df$AcceptedCmp3)
df$AcceptedCmp4<-factor(df$AcceptedCmp4)
df$AcceptedCmp5<-factor(df$AcceptedCmp5)

colSums(is.na(df))
df2<-df[complete.cases(df),] #removing NAs
colSums(is.na(df2))

outliers<- function(column){
  lower.bound<-quantile(column,0.25) - 1.5*IQR(column)
  upper.bound<-quantile(column,0.75) + 1.5*IQR(column)
  return(column[column<lower.bound | column>upper.bound])
}
outliers(df2$Income)
which(grepl(666666, df2$Income))
#remove row 514 (specifying data.frame, row index)
df2 <- df2[-c(514),]

outliers(df2$MntWines)
outliers(df2$MntFishProducts)
outliers(df2$MntGoldProds)
outliers(df2$MntMeatProducts)
outliers(df2$MntSweetProducts)
outliers(df2$MntFruits)
outliers(df2$NumDealsPurchases)
outliers(df2$NumCatalogPurchases)
outliers(df2$NumStorePurchases)
outliers(df2$NumWebPurchases)

df2$Age<-2021-df2$Year_Birth
#checking outliers in age
outliers(df2$Age)
which(grepl(128, df2$Age))
which(grepl(122, df2$Age))
which(grepl(121, df2$Age))
df2 <- df2[-c(495,797,2149),]

df2$Total.expenditure<-df2$MntWines+df2$MntFruits+df2$MntGoldProds+df2$MntMeatProducts+df2$MntSweetProducts+df2$MntFishProducts
head(df2)

Number of kids and Total expenditure

Is there an influence of the number of kids a customer has on their total expenditure? The number of kids in a household often has a great impact on how much a customer spends. Prior knowledge of this can help companies come up with better and more customized marketing schemes. Kids represent an important demographic to marketers. In addition to their own purchasing power, they also influence their parents’ buying decisions and expenditure. Below is a boxplot and histogram for the distribution of total expenses based on the number of kids a customer has.

plot_ly (df2,x=df2$Total.expenditure[df2$Kidhome=='0'],y = ~df2$Kidhome[df2$Kidhome=='0'], type="box", name = '0 kids',color=I("greenyellow")) %>%
add_trace(x=df2$Total.expenditure[df2$Kidhome=='1'],y = ~df2$Kidhome[df2$Kidhome=='1'], name = '1 kid',color=I("orange")) %>%
add_trace(x=df2$Total.expenditure[df2$Kidhome=='2'],y = ~df2$Kidhome[df2$Kidhome=='2'],name = '2 kids',color=I("hotpink")) %>%
layout(xaxis = list(title = 'Total expenditure')) %>%
layout(yaxis = list(title = 'Kids'))%>%
layout(title = 'Boxplot for total expenditure based on number of kids at home',plot_bgcolor='#e5ecf6', 
       xaxis = list( 
         zerolinecolor = '#ffff', 
         zerolinewidth = 2,
         gridcolor = 'ffff',range=c(0,3000)), 
       yaxis = list( 
         zerolinecolor = '#ffff', 
         zerolinewidth = 2, 
         gridcolor = 'ffff')) -> p
p
df20<-df2[df2$Kidhome==0,]
df21<-df2[df2$Kidhome==1,]
df22<-df2[df2$Kidhome==2,]
fig1 <- plot_ly(x = df20$Total.expenditure, type = 'histogram',color=I("greenyellow"),name="0 kids",opacity=0.5)%>%
  layout(xaxis=list(range=c(0,3000)),yaxis=list(range=c(0,300)) )
fig2 <- plot_ly(x=df21$Total.expenditure, type ='histogram',color=I("orange"),name="1 kid",opacity=0.5) %>%
  layout(xaxis=list(range=c(0,3000)),yaxis=list(range=c(0,300)) )
fig3 <- plot_ly(x=df22$Total.expenditure, type='histogram',color=I("hotpink"),name="2 kids",opacity=0.5) %>%
  layout(xaxis=list(range=c(0,3000)),yaxis=list(range=c(0,300)) )
fig <- subplot(fig1, fig2, fig3, nrows = 3) %>% 
  layout(title = list(text = "Histograms for total expenditure based on number of kids at home"),
         plot_bgcolor='#e5ecf6' 
         ) 
fig
df2%>% group_by(Kidhome)%>% summarise(sd(Total.expenditure))
median(df2$Total.expenditure[df2$Kidhome=='0'])
mean(df2$Total.expenditure[df2$Kidhome=='0'])
df2%>% group_by(Kidhome)%>% summarise(sd(Total.expenditure))
median(df2$Total.expenditure[df2$Kidhome=='1'])
mean(df2$Total.expenditure[df2$Kidhome=='1'])
df2%>% group_by(Kidhome)%>% summarise(sd(Total.expenditure))
median(df2$Total.expenditure[df2$Kidhome=='2'])
mean(df2$Total.expenditure[df2$Kidhome=='2'])

Findings

  1. Customers with no kids were found to have the highest expenditure.
  2. They had a median expenditure of $914, as compared to a median expenditure of $73 and $62 by customers with 1 and 2 kids respectively.
  3. The variability in the data was also higher for customers with no kids.
  4. Their expenditure ranged from $0 to $2525 and had a standard deviation of $590, as opposed to $318 and $175 for customers with 1 and 2 kids respectively.
  5. Distribution of expenditure for customers with no kids was close to symmetric (mean very close to median), whereas distribution of expenditure for customers with 1 and 2 kids was right skewed (mean>median).
  6. Companies can focus their marketing schemes towards this demographic (Customers with no kids) in order to increase revenue.

Number of teenagers and Total expenditure

We try to explore the effect of the number of teenagers in a household on total expenditure. Marketing to young people is a strategy used by a wide range of companies. Knowing their most likely customers can help businesses better target their sales strategies by focusing on the majority demographic or offering discounts and offers to less likely subgroups.

plot_ly(df2,x=df2$Total.expenditure[df2$Teenhome=='0'],y = ~df2$Teenhome[df2$Teenhome=='0'], type="box", name = '0 teens',color=I("greenyellow")) %>%
  add_trace(x=df2$Total.expenditure[df2$Teenhome=='1'],y = ~df2$Teenhome[df2$Teenhome=='1'], name = '1 teens',color=I("orange")) %>%
  add_trace(x=df2$Total.expenditure[df2$Teenhome=='2'],y = ~df2$Teenhome[df2$Teenhome=='2'],name = '2 teens',color=I("hotpink")) %>%
  layout(xaxis = list(title = 'Total expenditure')) %>%
  layout(yaxis = list(title = 'Teenagers'))%>%
  layout(title = 'Boxplot for total expenditure based on number of teenagers at home',plot_bgcolor='#e5ecf6', 
       xaxis = list( 
         zerolinecolor = '#ffff', 
         zerolinewidth = 2,
         gridcolor = 'ffff',range=c(0,3000)), 
       yaxis = list( 
         zerolinecolor = '#ffff', 
         zerolinewidth = 2, 
         gridcolor = 'ffff')) -> p2
p2
df20b<-df2[df2$Teenhome==0,]
df21b<-df2[df2$Teenhome==1,]
df22b<-df2[df2$Teenhome==2,]
fig1b <- plot_ly(x = df20b$Total.expenditure, type = 'histogram',color=I("greenyellow"),name="0 teens",opacity=0.5)%>%
  layout(xaxis=list(range=c(0,3000)),yaxis=list(range=c(0,400)) )
fig2b <- plot_ly(x=df21b$Total.expenditure, type ='histogram',color=I("orange"),name="1 teens",opacity=0.5) %>%
  layout(xaxis=list(range=c(0,3000)),yaxis=list(range=c(0,400)) )
fig3b <- plot_ly(x=df22b$Total.expenditure, type='histogram',color=I("hotpink"),name="2 teens",opacity=0.5) %>%
  layout(xaxis=list(range=c(0,3000)),yaxis=list(range=c(0,400)) )
figb <- subplot(fig1b, fig2b, fig3b, nrows = 3) %>% 
  layout(title = list(text = "Histograms for total expenditure based on number of teens at home"),
         plot_bgcolor='#e5ecf6') 
figb

Findings

  1. Customers with two teenagers had the highest median expenditure of $458, as compared to a median expenditure of $443 and $397 by customers with 0 and 1 teens respectively.
  2. The variability in the data was highest for customers with no teens.
  3. Their expenditure ranged from $0 to $2525 and had a standard deviation of $698, as opposed to $483 and $466 for customers with 2 and 1 teens respectively.
  4. Distribution of expenditure for customers was right skewed (mean>median), irrespective of the number of teenagers.
  5. There were greater number of observations (or customers) with lesser expenditure.

Country wise Expenditure

Which country has the highest expenditure? Global marketing focuses a product on the needs of potential buyers in various countries. One of the key aspects in global marketing is knowing the consumer and connecting with the audience. Knowing who needs your product, what form they need it in, and how to market it in a way that strengthens the brand are essential to every business. It is of utmost importance to identify countries where the business’ product might be successful, and then localize the brand to reflect the needs of those communities.

req = aggregate(df2$Total.expenditure, by=list(Category=df2$Country), FUN=sum)
req = as.data.frame.matrix(req) 
req$Category = factor(req$Category, levels = req$Category[order(req$x, decreasing = TRUE)])

req %>% plot_ly(x = ~Category, y = ~x, type = 'bar',opacity=0.5,color = I("lightslategrey")) %>% layout(title = "Barplot for country wise expenditure",
                    xaxis = list(title = "Country"),
                    yaxis = list(title = "Expenditure",range=c(0,700000)))

Findings

Singapore was found to have highest expenditure, followed by Saudi Arabia and Canada. Montenagro has the lowest expenditure. Top 5 countries in terms of expenditure:

  1. Singapore
  2. Saudi Arabia
  3. Canada
  4. Australia
  5. India

Country wise Expenditure on various products

Which country spends the most on each product category? We try to explore how much each country spends on various products, including Wine, Fruits, Sweets, Meat, Gold and Fish.

fig1c <- plot_ly(x = df2$Country, y =df2$MntFishProducts, type = 'bar', opacity=0.5,color = I("greenyellow"),name="Fish")

fig2c <- plot_ly(x = df2$Country, y = df2$MntFruits, type = 'bar', opacity=0.5,color = I("purple"),name="Fruits")%>%
  layout(yaxis=list(range=c(0,30000))) 

fig3c <- plot_ly(x = df2$Country, y=df2$MntGoldProds, type = 'bar', opacity=0.5,color = I("orange"),name="Gold")%>%
  layout(yaxis=list(range=c(0,50000))) 

fig4c <- plot_ly(x =df2$Country, y=df2$MntMeatProducts, type = 'bar', opacity=0.5,color = I("magenta"), name="Meat")%>%
  layout(yaxis=list(range=c(0,200000))) 

fig5c <- plot_ly(x = df2$Country, y=df2$MntSweetProducts, type = 'bar', opacity=0.5,color = I("yellow"),name="Sweets")

fig6c <- plot_ly(x =df2$Country, y=df2$MntWines, type = 'bar', opacity=0.5,color = I("salmon"), name="Wines")%>%
  layout(yaxis=list(range=c(0,400000))) 


figc <- subplot(fig1c, fig2c, fig3c, fig4c, fig5c, fig6c, nrows = 3) %>%
  layout(title = list(text = "Barplots for expenditure of countries on different products"),
         plot_bgcolor='#e5ecf6')
figc
#Grouped barplot
df2$Country <- factor(df2$Country, levels = unique(df2$Country)[order(df2$MntWines, decreasing = FALSE)])
fig1d <- plot_ly(data=df2, x = ~Country, y = ~MntFishProducts, type = 'bar', opacity=0.6,color = I("greenyellow"),name="Fish")
fig2d <- fig1d %>% add_trace(y = ~MntFruits, name = 'Fruits', opacity=0.6,color = I("purple"))
fig3d <- fig2d %>% add_trace(y = ~MntGoldProds, name = 'Gold', opacity=0.6,color = I("orange"))
fig4d <- fig3d %>% add_trace(y = ~MntMeatProducts, name = 'Meat', opacity=0.6,color = I("magenta"))
fig5d <- fig4d %>% add_trace(y = ~MntSweetProducts, name = 'Sweets', opacity=0.6,color = I("yellow"))
fig6d <- fig5d %>% add_trace(y = ~MntWines, name = 'Wines', opacity=0.6,color = I("salmon"))
fig7d <- fig6d %>% layout(title = list(text = "Grouped Barplot for expenditure of countries on different products"),
                          plot_bgcolor='#e5ecf6',yaxis = list(title = 'Expenditure',range=c(0,350000)), barmode = 'grouped')
fig7d

Findings

  1. Singapore had the highest expenditure for all product categories (fish, fruits, wines, meat, sweets and gold).
  2. Each country spends the most on wines and meat as compared to other product categories.
  3. Most countries spent the least on Sweets, followed by fruits.

Level of Education and marital status of customers

It is imperative to understand the distribution of different educational status and marital statuses of customers to examine if there is a group that dominates the dataset.

#Pie chart for education status and marital status
fig <- plot_ly(mai = c(0, 0, 0, 0),marker = list(colors=c("peachpuff","lightsteelblue", "pink","paleturquoise", "lightgray","plum","mediumslateblue", "gray","lavenderblush")))
fig <- fig %>% add_pie(data = count(df2, Education), labels = ~Education,
                       values = ~n,name = "Education",domain = list(x = c(0, 0.4), y = c(0.2, 1)))
fig <- fig %>% add_pie(data = count(df2, Marital_Status), labels = ~Marital_Status,
                       values = ~n, name = "Marital_Status", domain = list(x = c(0.6, 1), y = c(0.2, 1)))
fig <- fig %>% layout(title = "Pie Charts for distribution of data based on Education and Marital Status", showlegend = T,strip.white=FALSE,xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE), yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
fig

Findings

  1. Majority of the customers were undergraduates.
  2. Over half the customers (51.6%) were undergraduates.
  3. Approximately one quarter of the customers had completed their Masters and an even smaller percentage (22.3%) were PhDs.
  4. Most consumers were Married (38.7%), followed by people who were in a relationship (25.8%) and single consumers (21%).

Number of Web, In store purchases and Education status

Customers of which education level spent the most in store vs on the web? Though we live in a digital world, In store shopping is still very much alive. Despite online shopping becoming much more prevalent, a vast number of shoppers still make purchases in-store. Whether customers are shopping in-store or online, each type offers its own benefits. It is important to make sure and analyze if this shift happens based on the education status of the customer or not. Utilizing this target demographic analysis will help the business save time and money by discovering if the customer is more likely to make a purchase online or in a store.

dfsd1 <-unique(df2$Education)
dfsd2<-df2[df2$Education== 'Under.graduate',]
dfsd3<-df2[df2$Education=='Master',]
dfsd4<-df2[df2$Education=='PhD',]

Education <- c("Under.graduate", "Master", "PhD")
NumWebPurchases <- c(sum(dfsd2$NumWebPurchases), sum(dfsd3$NumWebPurchases), sum(dfsd4$NumWebPurchases))
NumStorePurchases <- c(sum(dfsd2$NumStorePurchases), sum(dfsd3$NumStorePurchases) , sum(dfsd4$NumStorePurchases))
df <- data.frame(Education, NumWebPurchases, NumStorePurchases)
df$Education <- factor(df$Education, levels = unique(df$Education)[order(df$NumStorePurchase, decreasing = FALSE)])
p <- plot_ly(df, x = ~Education, y = ~NumWebPurchases, type = 'bar', name = 'Number of WebPurchases',color=I("lightpink")) %>%
  add_trace(y = ~NumStorePurchases, name = 'Num of StorePurchases',opacity=0.4,color=I('greenyellow')) %>%
  layout(yaxis = list(title = 'Count'), barmode = 'group')
p

Findings

  1. Undergraduates made the highest number of purchases online and in store.
  2. All customers, irrespective of education level, purchased more in stores as compared to online.

Number of Web, In store purchases based on Marital Status

Customers of what marital status spent the most in store vs on the web?

This analysis examines whether the demographic profile of a customer affects their attitude towards shopping behavior. The demographic profile variable that could be linked to attitude was chosen as marital status. The answer to this question would provide some understanding to service providers and businesses on the effect of demographic profile on online and in store shopping. This would help them in finding and implementing suitable strategies to enhance targeted shopping experience.

dfms1 <-unique(df2$Marital_Status)
dfms2<-df2[df2$Marital_Status== 'Divorced',]
dfms3<-df2[df2$Marital_Status=='Single',]
dfms4<-df2[df2$Marital_Status=='Married',]
dfms5<-df2[df2$Marital_Status=='Together',]
dfms6<-df2[df2$Marital_Status=='Widow',]
dfms7<-df2[df2$Marital_Status=='YOLO',]
dfms8<-df2[df2$Marital_Status=='Alone',]
dfms9<-df2[df2$Marital_Status=='Absurd',]

Marital_Status <- c("Divorced", "Single" , " Married" , "Together","Widow" , "YOLO" , "Alone" , "Absurd")
NumWebPurchasesms <- c(sum(dfms2$NumWebPurchases), sum(dfms3$NumWebPurchases),sum(dfms4$NumWebPurchases) ,sum(dfms5$NumWebPurchases) ,sum(dfms6$NumWebPurchases) ,sum(dfms7$NumWebPurchases),sum(dfms8$NumWebPurchases),sum(dfms9$NumWebPurchases))
NumStorePurchasesms <- c(sum(dfms2$NumStorePurchases), sum(dfms3$NumStorePurchases) , sum(dfms4$NumStorePurchases),sum(dfms5$NumStorePurchases) ,sum(dfms6$NumStorePurchases),sum(dfms7$NumStorePurchases),sum(dfms8$NumStorePurchases),sum(dfms9$NumStorePurchases))
dfMS <- data.frame(Marital_Status, NumWebPurchasesms, NumStorePurchasesms)
dfMS$Marital_Status <- factor(dfMS$Marital_Status, levels = unique(dfMS$Marital_Status)[order(dfMS$NumStorePurchasesms, decreasing = FALSE)])
p_marr_status <- plot_ly(dfMS, x = ~Marital_Status, y = ~NumWebPurchasesms, type = 'bar', name = 'Number of WebPurchases',color=I("darkgray")) %>%
  add_trace(y = ~NumStorePurchasesms, name = 'Num of StorePurchases',opacity=0.6,color=I("salmon")) %>%
  layout(yaxis = list(title = 'Count'), barmode = 'group')
p_marr_status

Findings

  1. People who were married made the most number of purchases, followed by those who were in a relationship and single people.
  2. Most customers, irrespective of their marital status, made a higher number of purchases in stores.
  3. Absurd, alone and YOLO customers barely made any purchases.

Sampling Techniques

Sampling is a technique used to select a subset of the population to make statistical inferences from them and draw conclusions about the whole population. For example, if a company would like to know the customer preferences for an entire country, it is not plausible to conduct a research study that involves everyone. Thus a representative and unbiased sample of this population is used instead. There are different Sampling techniques that can be used. For this analysis, we have used Simple Random Sampling with Replacement, Simple Random Sampling without Replacement, Systematic Sampling and Stratified Sampling. We have demonstrated the use of these techniques by plotting a histogram for the Income feature (Sample size was taken as 80).

Simple Random Sampling

In a simple random sampling (SRS), every item of the frame (list of items that define a population) has an equal chance of being selected in the sample. Samples can be chosen with or without replacement. Here ‘n’ represents the sample size and ‘N’ represents number of items in the frame. The probability of selecting each item in SRS with replacement is 1/N. In case of SRS without replacement, the probability of selecting the first element is 1/N, second element is 1/N-1, and so on.

Systematic Sampling

In Systematic sampling, the frame is divided into groups.If ‘n’ is the sample size and ‘N’ is the size of the frame, each group has N/n=k items.Systematic sampling involves a random start (i.e. item from the first group of k elements is randomly selected). After this, every kth element from the frame is selected to be a part of the sample. We can also compute the probability for each item to be included in the sample with probabilities proportional to the size. Here we compute probabilities based on total expenditure of each customer.

Stratified Sampling

In Stratified Sampling, the frame is divided into subgroups based on distinct characteristics of the data. These subgroups are called ‘strata’. Each strata is then sampled as an independent sub-population, out of which individual elements can be randomly selected and then combined in order to form the sample.

#Sampling
library("sampling")
#Plotting histogram for original data
set.seed(123)
fig8 <- plot_ly(x = df2$Income, type = "histogram",histnorm = "probability",color=I("darkgray"),name="Histogram for all data",opacity=0.8)%>% layout( xaxis=list(zerolinecolor = '#fff',zerolinewidth = 2,gridcolor = 'ffff',range=c(0,180000),title="Income"),yaxis=list(range=c(0,0.040),title="Density")) 
fig8%>%layout(title="Histogram for all data")
#Simple Random Sampling with replacement
set.seed(123)
N <- nrow(df2)
n <- 80
s<-srswr(n,N)
#s
#s[s!=0]
rows<-rep((1:nrow(df2))[s!=0],s[s!=0])
srs<-df2[rows,]
fig8a <- plot_ly(x = srs$Income, type = "histogram",histnorm = "probability",color=I("greenyellow"),name="Srswr",opacity=0.5)%>% layout(xaxis=list(zerolinecolor = '#fff',zerolinewidth = 2,gridcolor = 'ffff',range=c(0,180000),title="Income"),yaxis=list(range=c(0,0.25),title="Density")) 


#Simple Random Sampling without replacement
set.seed(123)
s2<-srswor(n,N)
#s2
#s2[s2!=0]
rows2<-(1:nrow(df2))[s2!=0]
srs2<-df2[rows2,]
fig8b<- plot_ly(x = srs2$Income, type = "histogram",histnorm = "probability",color=I("orange"),name="Srswor",opacity=0.8)%>% layout(xaxis=list(zerolinecolor = '#fff',zerolinewidth = 2,gridcolor = 'ffff',title="Income",range=c(0,180000)),yaxis=list(range=c(0.25),title="Density")) 


#Systematic Sampling
set.seed(123)
k<-ceiling(N/n)
#k
r<-sample(k,1)
#r
sys<-seq(r,by=k,length=n)
#sys
s3<-df2[sys,]
#s3
fig8c <- plot_ly(x = s3$Income, type = "histogram",histnorm = "probability",color=I("pink"),opacity=0.5,name="Systematic Sampling")%>% layout(xaxis=list(zerolinecolor = '#fff',zerolinewidth = 2,gridcolor = 'ffff',title="Income",range=c(0,180000)),yaxis=list(title="Density",range=c(0,0.25))) 

#Systematic Sampling (Unequal probabilities)
set.seed(123)
pick<-inclusionprobabilities(df2$Total.expenditure,80)
s3b<-UPsystematic(pick)
s3c<-df2[s3b!=0,]
fig8c2 <- plot_ly(x = s3c$Income, type = "histogram",histnorm = "probability",color=I("blue"),opacity=0.5,name="Systematic Sampling (unequal prob)")%>% layout(xaxis=list(zerolinecolor = '#fff',zerolinewidth = 2,gridcolor = 'ffff',title="Income",range=c(0,180000)),yaxis=list(range=c(0.25),title="Density")) 

#Stratified Sampling
set.seed(123)
df2$Education<-factor(df2$Education,ordered=T,levels=c('Under.graduate','Master','PhD'))
freq<-table(df2$Education)
st.sizes<-80*(freq/sum(freq))  #calculating sample sizes for unequal strata
#st.sizes
s4<-sampling::strata(df2,stratanames=c("Education"),method="srswor",size=st.sizes)
#s4
d<-getdata(df2,s4)
fig8d <- plot_ly(x = d$Income, type = "histogram",histnorm = "probability",color=I("magenta"),name="Stratified sampling",opacity=0.5)%>% layout(xaxis=list(zerolinecolor = '#fff',zerolinewidth = 2,gridcolor = 'ffff',title="Income",range=c(0,180000)),yaxis=list(title="Density",range=c(0,0.25)))

s <- subplot(fig8a, fig8b, fig8c, fig8c2, fig8d, nrows = 3, shareY = TRUE)%>%layout(title='Sampling techniques')
s
mean(df2$Income) #population mean
mean(srs$Income) #srswr
mean(srs2$Income) #srswor
mean(s3$Income) #systematic 
mean(s3c$Income) #systematic with unequal probabilities
mean(d$Income) #stratified

sd(df2$Income) #population standard deviation
sd(srs$Income) #srswr
sd(srs2$Income) #srswor
sd(s3$Income) #systematic 
sd(s3c$Income) #systematic with unequal probabilities
sd(d$Income) #stratified

Findings

The mean of the samples extracted using the various sampling techniques was found to be very close to the population mean, except for systematic sampling with unequal probabilities. Standard deviation of the samples is also close to the population’s standard deviation, except in the case of systematic sampling. Hence we can use these samples in order to draw inferences about our population.

Linear Regression Model for Expenditure and Income

Is there a relationship between a customer’s income and their expenditure?

The relationship between income and expenditure is often called a consumption schedule. It is used to describe economic trends in the household sector. When there is more money or anticipation of income, more goods are purchased by consumers. Meaning money is spent on expenditures, at times, even if there isn’t enough income to cover them. This is a common economic principal used to describe spending trends for national and world economies. A business should consider the relationship between consumption and savings to extract data on buyer trends within its own industry. Hence, this question will help to make a business decision about the relationship of expenditure vs income for the given dataset.

Linear Regression Model

Using SLR, we assert a straight line on the scatterplot that represents the best fitting line to the data that captures the pattern of the relationship.We build a data model that allows us to:

  1. Quantify the relationship between the response variable and the explanatory variable
  2. Predict the response of a new observation with a given value for the explanatory variable.

The equation for the simple linear regression line is given by y = β0 + β1x, where y is the response or dependent variable, x is the explanatory or independent variable, β0 is the intercept (the value of y when x = 0), β1 is the slope (the expected change in y for each one-unit change in x).The most common way to find the best fit line is to minimize the sum of the squares of the distances between the points and the regression line. This approach is called the least-squares method.

pal <- c("peachpuff","deeppink","darkmagenta","blueviolet","yellow","darkolivegreen")
dfexp <- df2[,c(29,5,30)]
cor(df2$Total.expenditure,df2$Income) #r
## [1] 0.7903145
mdl<-lm(df2$Total.expenditure~df2$Income)
summary(mdl) #R2=0.6244
## 
## Call:
## lm(formula = df2$Total.expenditure ~ df2$Income)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2986.6  -225.7   -40.5   208.0  2244.4 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.696e+02  2.141e+01  -26.60   <2e-16 ***
## df2$Income   2.256e-02  3.766e-04   59.89   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 370.5 on 2156 degrees of freedom
## Multiple R-squared:  0.6246, Adjusted R-squared:  0.6244 
## F-statistic:  3587 on 1 and 2156 DF,  p-value: < 2.2e-16
dfexp %>%
  plot_ly(x = ~Income, y = ~Total.expenditure,colors = pal) %>%
  add_markers(color = ~Age) %>%
  add_lines(x = ~Income, y = fitted(mdl))
fitted(mdl) #predicted values
resid(mdl) #residuals of the model (difference between actual and predicted values)
rmse<-sqrt(mean(sum((df2$Total.expenditure-fitted(mdl))^2)))
cat("RMSE is", rmse)
## RMSE is 17202.95
mae<-mean(abs(df2$Total.expenditure-fitted(mdl)))
cat("MAE is", mae) #mean absolute error
## MAE is 273.2538

Findings

We created a linear model to predict total expenditure based on their income. Here income was taken as the independent or explanatory variable and total expenditure was the response or dependent variable.

As income increases, expenditure was seen to increase as well. Curved/ quadratic relationship was observed from the scatterplot. Looking at the scatterplot,

  1. Form: Curved relation
  2. Direction: positive association
  3. Strength: moderately strong.

Correlation coefficient, r was found to be 0.79. Thus we can say that there is a positive association, since correlation coefficient > 0.Also 0.79 is closer to 1, hence explanatory and response variables are decently correlated, association is moderately strong.

On creating a linear regression model, we found that 62.4% of the variation in Total expenditure is explained by income. Coefficient B1=0.0225 and intercept B0=-569.2.

On a $1 increase in income, total expenditure increases by $0.0225. Root mean square error is 17202 and Mean absolute error is 273.25.

Linear equation: y= -569.2 + 0.0225x, where y=total expenditure and x=income. Applying transformations on the data and using a generalised data model could help improve these results.

Hypothesis test to check for linear relation between income and expenditure

Step 1: Stating the hypothesis and selecting alpha level

H0 (null hypothesis): B1=0 (no linear association between Income and Total expenditure)

Ha (alternate hypothesis): B1!=0 (there is a linear association between x and y) alpha or significance level is 0.05.

Step 2: Decide test statistic

One sample two sided t test.

t=B1/(sqrt(sum(sq(actual y-predicted y))/n-2) / sqrt(sum(sq(actual x-mean x))))

degree.of.freedom<-nrow(df)-2 degree.of.freedom [1]2238

Step 3: Decision rule

If |t|>=critical t value, reject null hypothesis. Otherwise, do not reject the null hypothesis.

crit.t <- qt(0.975,degree.of.freedom) #1.96

Decision rule: Reject H0 if |t|>= 1.96, or else do not reject.

Step 4: Compute test statistic

summary(mdl) t= 59.78

Step 5: Conclusion

As t > critical value of t, reject null hypothesis.

We have sufficient evidence at alpha=0.05 level that B1!=0. There is a significant linear association between income and total expenditure.

Confidence intervals:

confint(mdl,level=0.95)
##                    2.5 %        97.5 %
## (Intercept) -611.5544338 -527.57826113
## df2$Income     0.0218181    0.02329524

We can say with 95% confidence that true value of B1 lies between 0.021 and 0.023.

Central Limit Theorem

The Central Limit Theorem states that the distribution of the sample means for a given sample size of the population has the shape of a normal distribution. The mean of the sampling distribution is equal to the population mean and standard deviation is equal to the population standard deviation/ sqrt(n), where n is the sample size. As the sample size is increased, standard deviation decreases. Using attribute Age from our dataset, we show the application of Central Limit Theorem. Shown below are histograms for sample means of 250 random samples of sample size 10, 20, 30 and 40.

#hist(df2$Age)
agedf <- data.frame(df2$Age)
#agedf
sample23 <- 250

xbars1 = numeric(250)
xbars2 = numeric(250)
xbars3 = numeric(250)
xbars4 = numeric(250)
for (i in 1:250) {
  xbars1[i] <- mean(sample(df2$Age, 
                           size = 10, replace = TRUE))
}
agedf1 <- data.frame(xbars1)
#agedf1
p1 <- plot_ly(agedf1, x= ~agedf1$xbars1, type = 'histogram',histnorm = 'probability',color=I("blue"),opacity=0.4,name="Sample size:10")%>%layout(xaxis=list(range=c(40,65)),yaxis=list(title="Density",range=c(0,0.15)))

for (i in 1:250) {
  xbars2[i] <- mean(sample(df2$Age, 
                           size = 20, replace = TRUE))
}
agedf2 <- data.frame(xbars2)
p2 <- plot_ly(agedf2, x= ~agedf2$xbars2, type = 'histogram',histnorm = 'probability',color=I("salmon"),name="Sample size:20")%>%layout(xaxis=list(range=c(40,65)),yaxis=list(range=c(0,0.15)))


for (i in 1:250) {
  xbars3[i] <- mean(sample(df2$Age, 
                           size = 30, replace = TRUE))
}
agedf3 <- data.frame(xbars3)
p3 <- plot_ly(agedf3, x= ~agedf3$xbars3, type = 'histogram',histnorm = 'probability',color=I("greenyellow"),opacity=0.6, name="Sample size:30")%>%layout(xaxis=list(range=c(40,65)),yaxis=list(title="Density",range=c(0,0.15)))


for (i in 1:250) {
  xbars4[i] <- mean(sample(df2$Age, 
                           size = 40, replace = TRUE))
}
agedf4 <- data.frame(xbars4)
p4 <- plot_ly(agedf4, x= ~agedf4$xbars4, type = 'histogram',color=I("pink"),histnorm = 'probability',name="Sample size:40")%>%layout(xaxis=list(range=c(40,65)),yaxis=list(range=c(0,0.15)))

p <- subplot(p1, p2, p3, p4, nrows = 2, shareY = TRUE)%>%layout(title="Central Limit Theorem")
p
cat("Sample Size = ", 10, " Mean = ", mean(xbars1),
    " SD = ", sd(xbars1))
## Sample Size =  10  Mean =  52.2988  SD =  3.969011
cat("Sample Size = ", 20, " Mean = ", mean(xbars2),
    " SD = ", sd(xbars2))
## Sample Size =  20  Mean =  52.1824  SD =  2.488106
cat("Sample Size = ", 30, " Mean = ", mean(xbars3),
    " SD = ", sd(xbars3))
## Sample Size =  30  Mean =  52.26093  SD =  2.003406
cat("Sample Size = ", 40, " Mean = ", mean(xbars4),
    " SD = ", sd(xbars4))
## Sample Size =  40  Mean =  52.4402  SD =  1.914103

As we can see, the mean of the sampling distribution remains almost the same as the overall population mean, whereas the standard deviation decreases as sample size is increased.

Number of Deals and Age

“The proof is in the purchase,” as the saying goes.

Purchase-based targeting (PBT) goes after consumers who have a purchase history with your offering. Statistics show that those who bought once are more likely to buy again. PBT is highly productive and so this question will help to know people of what age have opted for more deals than others and this will ultimately help strategize PBT of consumers.

plot_ly(df2,x=df2$NumDealsPurchases,y = ~df2$Age, type="box",opacity=0.6,color=I("darkcyan"))%>%layout(title="Boxplot for Number of deals and Age of Customer", xaxis=list(title="Number of deals",range=c(0,16)),yaxis=list(title="Age"))
df2 %>%
  plot_ly(x = ~Age, y = ~NumDealsPurchases,opacity=0.6,color=I("darkcyan")) %>%
  add_markers(size = ~Income) %>% layout(title="Number of deals and Age",yaxis=list(title="Number of deals",range=c(0,16)),xaxis=list(title="Age"))

Findings

  1. Most of the customers went for 1 to 4 deals.
  2. Very few people went for over 10 deals.
  3. Consumers who went for 13 ore more deals usually had very high or very low income.

Number of kids, teens and Web purchases

#Stacked bar plot for kids and teens and the no of web purchases

g1 <- plot_ly(df2, x = ~NumWebPurchases, y = ~Teenhome, color=I("lightslategrey"),type = 'bar',opacity=0.6, name = 'Teens') %>%
  add_trace(y = ~Kidhome, color=I("grey"),opacity=0.5,name = 'Kids') %>%
  layout(title="Teens,kids and web purchases",yaxis = list(title = 'Teens/Kids'), barmode = 'stack',xaxis=list(title="Web purchases",range=c(0,14000)))
g1

Findings

  1. People with no kids made the highest number of web purchases.
  2. However, people with 1 teenager made the highest number of web purchases.
  3. People with 2 kids or 2 teenagers made the least number of web purchases.

Number of kids, teens and In store purchases

#Stacked bar plot for kids and teens on the no of in store purchases
s1 <- plot_ly(df2, x = ~NumStorePurchases, color=I("lightslategrey"),opacity=0.6, y = ~Teenhome, type = 'bar', name = 'Teens') %>%
  add_trace(y = ~Kidhome, color=I("grey"),opacity=0.5,name = 'Kids') %>%
  layout(title="Teens,kids and in store purchases",yaxis = list(title = 'Teens/Kids'), barmode = 'stack',xaxis=list(title="In store purchases",range=c(0,20000)))
s1
df2%>%group_by(Kidhome)%>%summarise(sum(NumWebPurchases))
df2%>%group_by(Kidhome)%>%summarise(sum(NumStorePurchases))
df2%>%group_by(Teenhome)%>%summarise(sum(NumWebPurchases))
df2%>%group_by(Teenhome)%>%summarise(sum(NumStorePurchases))

Findings

  1. People with no kids made the highest number of in store purchases.
  2. Also, people with no teenagers had the highest number of in store purchases.
  3. Similar to the case of online shopping, people with 2 kids or 2 teenagers made the least number of in store purchases.
  4. Number of in store purchases were higher than online purchases.

Types of purchases based on number of kids and number of teenagers.

#All kinds of Purchases based on number of kids as a Grouped Bar Chart
fig1d <- plot_ly(data=df2, x = ~Kidhome, y = ~NumStorePurchases, type = 'bar', opacity=0.5,color = I("greenyellow"),name="No of In Store Purchases")
fig2d <- fig1d %>% add_trace(y = ~NumWebPurchases, name = 'Web Purchases', opacity=0.5,color = I("magenta"))
fig3d <- fig2d %>% add_trace(y = ~NumDealsPurchases, name = 'Deals Purchased', opacity=0.6,color = I("orange"))
fig4d <- fig3d %>% add_trace(y = ~NumCatalogPurchases, name = 'Catalog Purchases', opacity=0.5,color = I("salmon"))
fig5d <- fig4d %>% add_trace(y = ~NumWebVisitsMonth, name = 'Web Visits/Month', opacity=0.4,color = I("blue"))
fig6d <- fig5d %>% layout(title="Types of purchases based on number of kids",yaxis = list(title = 'Purchases',range=c(0,30000)),xaxis=list(title="Kids"), barmode = 'stack')
fig6d
#All kinds of Purchases based on number of teens as a stacked bar chart
fig1d <- plot_ly(data=df2, x = ~Teenhome, y = ~NumStorePurchases, type = 'bar', opacity=0.5,color = I("greenyellow"),name="No of In Store Purchases")
fig2d <- fig1d %>% add_trace(y = ~NumWebPurchases, name = 'Web Purchases', opacity=0.5,color = I("magenta"))
fig3d <- fig2d %>% add_trace(y = ~NumDealsPurchases, name = 'Deals Purchased', opacity=0.6,color = I("orange"))
fig4d <- fig3d %>% add_trace(y = ~NumCatalogPurchases, name = 'Catalog Purchases', opacity=0.5,color = I("salmon"))
fig5d <- fig4d %>% add_trace(y = ~NumWebVisitsMonth, name = 'Web Visits/Month', opacity=0.4,color = I("blue"))
fig6d <- fig5d %>% layout(title="Types of purchases based on number of teens",yaxis = list(title = 'Purchases',range=c(0,25000)), barmode = 'stack',xaxis=list(title="Teens"))
fig6d

Findings

Top 3 categories of purchases include:

  1. In store
  2. Web or online
  3. Monthly Web visits

Inferences Based on Purchase: 1. Investment in deals and catalog purchases was comparatively scarce. 2. Highest number of In store and catalog purchases were made by customers with no teens. 3. Highest number of Web and Deal purchases as well as web visits were made by customers with one teen. 4. Highest number of In store, catalog and web purchases were made by customers with no kids. 5. Highest number of monthly Web visits were made by customers with one kid. 6. Customers with two kids or two teens made the least number of purchases.

Number of Web visits per month

Our data spanned across 2.5 years (from 2012 to 2014). For 2012 and 2014, we only had data for the second and first half of the year respectively. We separated the date column to get year, month and day values and filtered it to get year wise data. Subsequently, we used group_by and summarise operations to get the sum of all web visits per month. We then arranged this data in ascending order based on months. Our aim was to see if there was any trend in web visits across months for three years.

df2 %>%
  separate(Dt_Customer, sep="/", into = c("month","day","year"))->df3
df3%>%filter(year=="12")%>%group_by(month)%>%summarise(Webvisits=sum(NumWebVisitsMonth))->a
a$month<-as.numeric(a$month)
a<-a%>%arrange(month)
df3%>%filter(year=="13")%>%group_by(month)%>%summarise(Webvisits=sum(NumWebVisitsMonth))->b
b$month<-as.numeric(b$month)
b<-b%>%arrange(month)
df3%>%filter(year=="14")%>%group_by(month)%>%summarise(Webvisits=sum(NumWebVisitsMonth))->c
c$month<-as.numeric(c$month)
c<-c%>%arrange(month)
fig9 <- plot_ly(a, x = ~month, y = ~Webvisits, type = 'scatter', name="Year 2012",mode = 'lines',line=list(color="greenyellow",width=4))%>%layout(xaxis=list(range=c(1,12)),yaxis=list(range=c(0,800)))
fig9b <- plot_ly(b, x = ~month, y = ~Webvisits, type = 'scatter', name="Year 2013", mode= 'lines',line=list(color="orange",width=4))%>%layout(xaxis=list(range=c(1,12)),yaxis=list(range=c(0,800)))
fig9c<-plot_ly(c, x = ~month, y = ~Webvisits, type = 'scatter', name="Year 2014", mode = 'lines',line=list(color="pink",width=4))%>%layout(xaxis=list(range=c(1,12)),yaxis=list(range=c(0,800)))
fig9d <- subplot(fig9, fig9b, fig9c, nrows = 3) %>% 
  layout(title = list(text = "Lineplots for number of web visits per month"),
         plot_bgcolor='#e5ecf6') 
fig9d

Findings

  1. In general, web visits peaked in the month of March, followed by fewer visits till June.
  2. In the second half of the year, web visits were comparatively higher.